Ontologies in the Time of Linked Data

نویسندگان

  • Hilary Thorsen
  • Cristina Pattuelli
چکیده

This paper discusses some of the methodological issues one encounters when creating and using ontologies in the rapidly expanding Linked Open Data (LOD) landscape. Over the years the notion of applied ontologies has transitioned from that of a logically formalized knowledge system with varying degrees of inferencing power to that of a lightweight knowledge representation tool. This shift is reflected in the current lexicon where different actors in the LOD community use the term ontology interchangeably with more generic terms like vocabulary1 or even namespace or data schema. Applied ontologies have been a key area of research in the context of Semantic Web initiative since the late 1990s. The Semantic Web has recently found a new stream of development in the Linked Data initiative, which is considered its natural evolution (Allemang and Hendler, 2011). While a good deal of literature has been devoted to investigating ontology engineering for the Semantic Web, not enough attention has yet been paid to understanding the nature and role that ontologies play in the linked data context, especially from the lens of knowledge organization research. Based on our ongoing work creating Linked Open Data applications and services for digital resources in the domain of the performing arts, we describe methodological steps and lessons learned in line with the spirit of the linked data initiative, where an agile and pragmatic approach to development is combined with the practice of learning from one another. BACKGROUND LOD is a W3C and community-based initiative working on extending the Web as we know it by meaningfully connecting data from heterogeneous sources. The central idea of linked data is achieved by making data processable by machines and accessible seamlessly using the Web itself as a unifying discovery space. In an effort to realize the “Web of Data”, as Tim Berners-Lee defines the LOD environment (Bizer, Heath, and Berners-Lee, 2009), a great number of LOD datasets have been created and made freely available for sharing, reuse and interlinking. A visualization of the LOD ecosystems is provided by the LOD cloud2, showing the dense interlinking between heterogeneous sets of data that continues to exponentially grow in different domains. Ontologies form the backbone of this linked data environment and are key to supporting its open and distributed infrastructure. To help understand the 1 In the context of this paper, the terms ontology and vocabulary are also used interchangeably. 2 http://lod-cloud.net/ nature and role of ontologies in this emerging information context, we begin by describing the technological framework that supports LOD development. LOD TECHNOLOGY STACK The LOD infrastructure relies on a rather small set of existing open standards that are deeply engrained in the fabric of the Web: a naming standard to uniquely identify resources using the URI (Uniform Resource Identifier) and the common Hypertext Transfer Protocol HTTP. The RDF (Resource Description Framework) model serves as the unifying platform to represent and exchange data (Berners-Lee, 2009). In other words, RDF is the common framework for representing resources (Schreiber and Raimond, 2014). In the context of LOD, a resource is anything that can be identified or named including any object, event or unit of information that can be referenced by a URI. The basic blocks of the LOD infrastructure are simple statements, called RDF triples, composed of three atomic elements: a subject, a predicate and an object. Each element of a triple is paired with a URI that performs a referential function, making it both human readable and machine processable. Because of this global naming convention, any object of an RDF triple can become the subject of another triple, creating chains of relationships and representing the information space as a graph or network. As a consequence, the multitude of ontologies that populate the LOD ecosystem all share the same underlying semantics made up of relatively simple modeling constructs. The implications of relying on open standards and a common data model to facilitate data interoperability and integration are far-reaching. ONTOLOGY DEVELOPMENT METHODOLOGY The process of building applied ontologies has been addressed rather extensively, especially within the framework of the Semantic Web (Noy and McGuinness, 2001; GómezPérez, Corcho-García and Fernández-López, 2003). In the context of linked data development, methodological guidelines for creating and using RDF-based ontologies have yet to be established. However, best practices are emerging through shared documentation and lessons learned. Most recently, Villazón-Terrazas et al. (2011) propose a six-stage methodology that consists of: 1) specification, 2) modeling, 3) generation, 4) linking, 5) publication, and 6) exploitation. We will use these sequential steps as a general framework to address the process of building and using an ontology in the context of the Linked Jazz Project, which provides a real-world application scenario. ONTOLOGY SCOPE AND PURPOSE The primary goal of Linked Jazz3 is to leverage linked data principles and technologies to uncover the dense web of relationships between artists in the jazz community. The project relies on transcriptions of oral histories to identify relevant entities (jazz musicians) as well 3 https://linkedjazz.org/ as the professional and personal relationships that occur among them (Pattuelli, Miller, Lange, Fitzell, and Li-Madeo, 2013). While still evolving and expanding to new areas of cultural heritage, Linked Jazz has delivered a rich LOD dataset representing over 9,000 artist entities and their connections. These connections are assigned a specific relationship type. The source of the relationship is the occurrence of a mention in the transcript text. In other words, whenever the subject of an oral history mentions someone, a triple is created that expresses a claim that this individual knows of the person they cite. For example, the claim that Sam Rivers (subject) knows of (predicate) Dizzy Gillespie (object) is represented by the triple below: SPECIFICATION AND MODELING The specification of the content domain was based on the analysis of our data sources, oral histories in the field of jazz history. Jazz artists were identified as the primary entities. A small set of classes and properties was required to model these entities and their properties. One type of relationship (rel:knowsOf) was sufficient to describe the connections among musicians as derived from the data sources (Figure 1). The modeling process was driven by one of the main principles and established practices in LOD development: the reuse of existing and publicly available LOD semantics. The value of adopting terms, whenever possible, from existing RDF ontologies is considered a powerful way to make it easier for applications to process and integrate linked data (Heath and Bizer, 2011). Figure 1 Diagram of core set of classes and properties. As the project progressed and more semantic complexity was required to enrich the original set of data with new layers of meaning, the need arose to integrate the original core with a broader range of personal and professional relationships. More specifically, the nature of the basic connections held by musicians was further specified and their semantics assigned using a crowdsourcing approach (Pattuelli, Miller, Lange, and Thorsen, 2013). Through a dedicated platform4, different types of personal and professional relationships are manually contributed by volunteers with the goal of enriching the core Linked Jazz dataset with accurate and granular predicates. Crowd annotations are automatically mapped to a set of predicates derived from existing LOD vocabularies. These predicates, taken from the Relationship Vocabulary and Music Ontology, include rel:knowsOf, rel:hasMet, rel:acquaintanceOf, rel:closeFriendOf, rel:influencedBy, mo:collaborated_with, and rel:mentorOf. The capability to easily extend our evolving data model is an important trait of the modeling practices in the LOD context. Elements from different vocabularies are easily integrated in an existing ontology through the process of mixing and matching. Multiple vocabularies at once can serve as sources of semantics and enrich an ontology in a layered fashion without the need for community agreement for adoption. Properties with overlapping scope can coexist without hampering the consistency of the schema. Again, the openness and decentralized nature of the RDF framework are key to enable inclusion of terms from external sources without the need for coordination or formal agreements on the adoption of a specific schema. Both reuse and extensibility are distinctive traits of LOD ontologies that mark a clear departure from traditional computational ontologies. LOD ontologies are characterized by several features that facilitate the reuse of elements from existing RDF-vocabularies and their integration into a target knowledge system. First, they are lightweight in terms of their level of formality and typically small in size making it easy to manage and maintain them. Second, they rely on well-established and W3C-governed representation systems including SKOS (Simple Knowledge Organization System), RDFS (RDF Schema), and OWL (Web Ontology Language) as reference models and sources of essential concepts (e.g., owl:Thing usually serves as the ontology root). Having their basic representation framework grounded on sound and widely adopted standard vocabularies has the benefit of enforcing stability and facilitating interoperability and adoption by service providers and users. This high degree of flexibility is especially suitable for supporting the dynamic nature of the evolving web of linked data. As discussed earlier, the openness of the management and discovery environment (the Web itself), the light-weight nature of LOD standards, and the unifying role played by the RDF platform are key to enabling terms from published 4 https://linkedjazz.org/52ndStreet/ vocabularies to be easily included in an existing ontology. Rather than duplicating effort by defining concepts and properties that have been already defined elsewhere, we can point to the selected elements in an existing ontology via a URI. For example, the predicates that describe the personal and professional relationships in our extended dataset are derived from the Music Ontology5 as well as from the Relationship Vocabulary6. To this end, the identification of suitable vocabularies as sources of reusable semantics represents an important methodological step in the modeling stage of the ontology building process. VOCABULARY SELECTION As part of the acquisition process, deciding which LOD ontologies to reuse can be challenging. As the LOD initiative has grown exponentially in the last few years, RDF-based ontologies have proliferated creating considerable overlap between them. Some of these are general-purpose ontologies published by trusted and stable bodies, such as the W3C RDF Concepts Vocabulary7 and the Geospatial Vocabulary8, as well as the Dublin Core Ontology9. These general ontologies typically describe entities, such as people, organizations, events, and geographic locations. They serve as reference vocabularies and “anchors” for designing a conceptual model. DBPedia’s ontology10 is particularly useful for its cross-domain and extensive thematic coverage. Created to map the massive amounts of data extracted from Wikipedia, the DBpedia ontology has become a de facto reference vocabulary due to the popularity of the DBpedia dataset in the LOD landscape. Libraries (including the Library of Congress), museums (especially the International Council of Museums, which developed CIDOC-CRM11, or the International Committee for Documentation’s Conceptual Reference Model), and archives (which created the EAC-CPF, or the Encoded Archival Context-Corporate Bodies, Persons, and Families XML schema),12 have also published RDF ontologies to reflect the representational needs of the communities that create them. Some ontologies originally developed by individuals as independent projects have gained popularity through grassroots adoption. This is the case of FOAF13 (Friend of a Friend), which contains commonly used predicates for basic descriptions of people, and the Relationship Vocabulary, which provides predicates for describing relationships between people. While these vocabularies offer general terms, it is often necessary to include more granular terms from domain-specific ontologies, such as 5 http://musicontology.com/specification/ 6 http://vocab.org/relationship/.html 7 http://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/ 8 http://www.w3.org/2005/Incubator/geo/XGR-geo/ 9 http://dublincore.org/documents/dcmi-terms/ 10 http://mappings.dbpedia.org/server/ontology/classes/ 11 http://cidoc-crm.org/docs/cidoc_crm_version_5.0.4.pdf 12 http://labs.regesta.com/progettoReload/wp-content/uploads/2013/10/eac-cpf.html 13 http://xmlns.com/foaf/spec/ the Music Ontology, to enrich their expressivity (Raimond, Abdallah, Sandler and Giasson,

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

بررسی هستان شناسی های توسعه یافته مبتنی بر اصول هستان شناسی های منبع باز زیست پزشکی

Background and Aim: Ontologies facilitate data integration, exchange, searching and querying. Open Biomedical Ontologies (OBO) Foundry is a solution for creating reference ontologies. In this foundry, the design of ontologies is based on established principles which allow for their interactions as a single system. The purpose of this study is to determine the main features of ontologies develop...

متن کامل

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Automatic Workflow Generation and Modification by Enterprise Ontologies and Documents

This article presents a novel method and development paradigm that proposes a general template for an enterprise information structure and allows for the automatic generation and modification of enterprise workflows. This dynamically integrated workflow development approach utilises a conceptual ontology of domain processes and tasks, enterprise charts, and enterprise entities. It also suggests...

متن کامل

Automatic Workflow Generation and Modification by Enterprise Ontologies and Documents

This article presents a novel method and development paradigm that proposes a general template for an enterprise information structure and allows for the automatic generation and modification of enterprise workflows. This dynamically integrated workflow development approach utilises a conceptual ontology of domain processes and tasks, enterprise charts, and enterprise entities. It also suggests...

متن کامل

One Simple Ontology for Linked Data Sets

The Linking Open Data (LOD) cloud includes over 26 billion RDF triples from various domains. In order to access linked data sets, Semantic Web users have to understand the ontology schema of the data sets. However, understanding all the ontologies used in the LOD cloud is not feasible and is time-consuming. A simple and easily understandable ontology that integrates ontology schema from differe...

متن کامل

Building ontologies from folksonomies and linked data: Data structures and Algorithms

We present the data structures and algorithms used in the approach for building domain ontologies from folksonomies and linked data. In this approach we extracts domain terms from folksonomies and enrich them with semantic information from the Linked Open Data cloud. As a result, we obtain a domain ontology that combines the emergent knowledge of social tagging systems with formal knowledge fro...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015